Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IQ4_NL sgemm + Q4_0 AVX optimization #9422

Merged
merged 4 commits into from
Sep 16, 2024

Conversation

netrunnereve
Copy link
Collaborator

@netrunnereve netrunnereve commented Sep 11, 2024

This contains two changes in one, the first is basically a copy over of my shelved #8049 (IQ4_NL sgemm) which I planned to resubmit after #9330 got merged. IQ4_NL is basically Q4_0 with a lookup table and sgemm can be easily ported over to that quant.

I was able to test on an AVX2 machine this time so I've enabled this change for both AVX and AVX2. AVX2 is much faster due to #8908.

AVX2 (35% prompt processing improvement):

model size params backend threads test t/s
llama 8B IQ4_NL - 4.5 bpw (Master) 4.35 GiB 8.03 B CPU 4 pp512 7.01 ± 0.06
llama 8B IQ4_NL - 4.5 bpw (Master) 4.35 GiB 8.03 B CPU 4 tg128 2.88 ± 0.02
llama 8B IQ4_NL - 4.5 bpw (PR) 4.35 GiB 8.03 B CPU 4 pp512 9.46 ± 0.13
llama 8B IQ4_NL - 4.5 bpw (PR) 4.35 GiB 8.03 B CPU 4 tg128 2.86 ± 0.08

AVX (10% prompt processing improvement):

model size params backend threads test t/s
llama 8B IQ4_NL - 4.5 bpw (Master) 4.35 GiB 8.03 B CPU 8 pp512 9.28 ± 0.03
llama 8B IQ4_NL - 4.5 bpw (Master) 4.35 GiB 8.03 B CPU 8 tg128 6.85 ± 0.11
llama 8B IQ4_NL - 4.5 bpw (PR) 4.35 GiB 8.03 B CPU 8 pp512 10.23 ± 0.04
llama 8B IQ4_NL - 4.5 bpw (PR) 4.35 GiB 8.03 B CPU 8 tg128 6.96 ± 0.00

As our tests don't cover sgemm I ran a 10 chunk Wikitext perplexity with an IQ4_NL model and the numbers were within 0.2%. I also ran through some sample prompts and the model responded properly. If needed I can run with more chunks but it's going to take forever on my slow computer.

The second change basically makes the Q4_0 ggml_vec_dot function compute two blocks at a time for regular AVX, just like how it's done for IQ4_NL. This makes inference 7% faster.

From my testing this technique only helps with Q4_0 and doesn't do anything on Q8_0, which currently only calculates one block at a time. I think the eight loads (and hence eight registers) required to store two Q8_0 blocks adds way too much overhead.

model size params backend threads test t/s
llama 8B Q4_0 (Master) 4.33 GiB 8.03 B CPU 8 pp512 10.21 ± 0.00
llama 8B Q4_0 (Master) 4.33 GiB 8.03 B CPU 8 tg128 6.57 ± 0.01
llama 8B Q4_0 (PR) 4.33 GiB 8.03 B CPU 8 pp512 10.22 ± 0.05
llama 8B Q4_0 (PR) 4.33 GiB 8.03 B CPU 8 tg128 7.02 ± 0.03

test-quantize-fns and test-backend-ops is passing for this PR.

P.S. I saw that F16C was used in #8908 and wanted to see if that worked for inference as well, so I modified the IQ4_NL ggml_vec_dot function to convert and multiply the scales for four blocks at a time. Sadly that didn't have a visible performance impact and I removed it from this PR, though my code can be found in a201c6b if anyone wants to experiment with it.

readd my iq4_nl sgemm PR #8049

have ggml_vec_dot_q4_0 do two blocks per loop for avx

try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per #8549 we can calculate several blocks at a time with no issue
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Sep 11, 2024
@ggerganov ggerganov merged commit 5c3d0f1 into ggerganov:master Sep 16, 2024
52 checks passed
@netrunnereve netrunnereve deleted the avx_optimizations branch September 16, 2024 18:35
dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024
* squashed

readd my iq4_nl sgemm PR ggerganov#8049

have ggml_vec_dot_q4_0 do two blocks per loop for avx

try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggerganov#8549 we can calculate several blocks at a time with no issue

* shuffle

* remove f16c iq4_nl as i cant make it faster than before
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
* squashed

readd my iq4_nl sgemm PR ggerganov#8049

have ggml_vec_dot_q4_0 do two blocks per loop for avx

try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggerganov#8549 we can calculate several blocks at a time with no issue

* shuffle

* remove f16c iq4_nl as i cant make it faster than before
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
* squashed

readd my iq4_nl sgemm PR ggerganov#8049

have ggml_vec_dot_q4_0 do two blocks per loop for avx

try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggerganov#8549 we can calculate several blocks at a time with no issue

* shuffle

* remove f16c iq4_nl as i cant make it faster than before
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants